Data leakage

Training-test data split is one of the most critical process in building a Machine learning model.

If model sees the test data, “data leakage”, the model can overfit and produces seemingly high performance in the test while not generalizing to unseen data beyond the test set. Thus preventing data leakage, or ensuring “data hygiene”, is a critical step to ensure generalizability of the model.

Kapoor2022leakage studies various data leakage patterns.

Pre-registration for Predictive Modeling may be a good practice to prevent the leakage.

LLMs

Aiyappa2023Can raises the issue of evaluating closed LLMs. For closed LLMs, it is difficult to know whether the test dataset, if it has been public, has been used to train the LLM. Furthermore, researchers may have accidentally leaked the test data to the LLM, which uses user input to continuously train.